Skip to content

DeepSeek V4 #24162

Open
am17an wants to merge 44 commits into
ggml-org:masterfrom
am17an:dsv4
Open

DeepSeek V4 #24162
am17an wants to merge 44 commits into
ggml-org:masterfrom
am17an:dsv4

Conversation

@am17an

@am17an am17an commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Overview

Still a WIP, lots of work to do before this is usable. At the current stage it passes long context/tool calling tests but is quite slow. All the complexity is in the new llama-kv-cache-dsv4 + deepseekv4 model class + no new ggml ops at the moment.

To run you the flash version at least 100 GB VRAM (you can use the antirez's GGUF or use this PR to convert one), for the full flash version 160+ GB. Here's how I was running the server on a DGX spark

llama-server -m dsv4-q2_k.gguf -fa 0 -c 32768 --jinja --chat-template-file models/templates/deepseek-ai-DeepSeek-V4.jinja --fit off

Note that it is extremely slow at the moment (~4-5 toks/sec)

Thanks to @pwilkin for the correct chat template + debugging help
Thanks to @fairydreaming for his help in debugging + contributing fixes

Additional information

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, paired with both codex and claude.

@github-actions github-actions Bot added model Model specific python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 5, 2026
@fairydreaming

Copy link
Copy Markdown
Collaborator

@am17an I wonder what's the purpose of f32 casts and conts after mulmats here?

diff --git a/src/models/deepseek-v4.cpp b/src/models/deepseek-v4.cpp
index da3536f37..c8e17ef4e 100644
--- a/src/models/deepseek-v4.cpp
+++ b/src/models/deepseek-v4.cpp
@@ -828,11 +828,9 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_attention(
     ggml_tensor * hca_state_score = nullptr;
     if (ratio == DSV4_HCA_RATIO && inp_dsv4->get_hca().state_idxs) {
         hca_state_kv = build_lora_mm(layer.attn_comp_wkv, cur);
-        hca_state_kv = ggml_cont(ctx0, ggml_cast(ctx0, hca_state_kv, GGML_TYPE_F32));
         cb(hca_state_kv, "hca_state_kv", il);
 
         hca_state_score = build_lora_mm(layer.attn_comp_wgate, cur);
-        hca_state_score = ggml_cont(ctx0, ggml_cast(ctx0, hca_state_score, GGML_TYPE_F32));
         cb(hca_state_score, "hca_state_score", il);
 
         ggml_tensor * ape = layer.attn_comp_ape;
@@ -848,11 +846,9 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_attention(
 
     if (ratio == DSV4_CSA_RATIO && inp_dsv4->get_csa().state_idxs) {
         ggml_tensor * csa_state_kv = build_lora_mm(layer.attn_comp_wkv, cur);
-        csa_state_kv = ggml_cont(ctx0, ggml_cast(ctx0, csa_state_kv, GGML_TYPE_F32));
         cb(csa_state_kv, "csa_state_kv", il);
 
         ggml_tensor * csa_state_score = build_lora_mm(layer.attn_comp_wgate, cur);
-        csa_state_score = ggml_cont(ctx0, ggml_cast(ctx0, csa_state_score, GGML_TYPE_F32));
         cb(csa_state_score, "csa_state_score", il);
 
         ggml_tensor * csa_ape = layer.attn_comp_ape;
@@ -902,11 +898,9 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_attention(
         ggml_build_forward_expand(gf, csa_state_score);
 
         ggml_tensor * lid_state_kv = build_lora_mm(layer.indexer_comp_wkv, cur);
-        lid_state_kv = ggml_cont(ctx0, ggml_cast(ctx0, lid_state_kv, GGML_TYPE_F32));
         cb(lid_state_kv, "lid_state_kv", il);
 
         ggml_tensor * lid_state_score = build_lora_mm(layer.indexer_comp_wgate, cur);
-        lid_state_score = ggml_cont(ctx0, ggml_cast(ctx0, lid_state_score, GGML_TYPE_F32));
         cb(lid_state_score, "lid_state_score", il);
 
         ggml_tensor * lid_ape = layer.indexer_comp_ape;

Removed them and got the same logits.

@am17an

am17an commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@fairydreaming it's an artifact of debugging, you can push your changes to this branch (I added you as collaborator)

@github-actions github-actions Bot added script Script related testing Everything test related labels Jun 5, 2026
Comment thread scripts/gen-chat-inline-templates.py Outdated
@fairydreaming

fairydreaming commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Played with flash attention this weekend, here's my experimental patch:

diff --git a/src/llama-kv-cache-dsv4.cpp b/src/llama-kv-cache-dsv4.cpp
index 1737d62ae..82ab5f01f 100644
--- a/src/llama-kv-cache-dsv4.cpp
+++ b/src/llama-kv-cache-dsv4.cpp
@@ -323,6 +323,8 @@ static llama_kv_cache_dsv4_context::comp_plan dsv4_build_comp_plan(
         }
     }
 
+    plan.n_kv = GGML_PAD(plan.n_kv, 256u);
+
     if (overlap) {
         // [ all blocks' prev-window indices | all blocks' cur-window indices ]
         plan.state_read_idxs.reserve(overlap_prev_reads.size() + overlap_cur_reads.size());
@@ -686,7 +688,7 @@ llama_kv_cache_dsv4::llama_kv_cache_dsv4(
 
     kv_csa = std::make_unique<llama_kv_cache>(
             model, hparams_csa, type_k, type_v,
-            v_trans, offload, unified, dsv4_comp_size(kv_size, DSV4_CSA_RATIO), n_seq_max, n_pad,
+            v_trans, offload, unified, GGML_PAD(dsv4_comp_size(kv_size, DSV4_CSA_RATIO), 256u), n_seq_max, n_pad,
             0, LLAMA_SWA_TYPE_NONE, filter_csa, nullptr);
 
     LLAMA_LOG_INFO("%s: creating DSV4 HCA compressed KV cache, size = %u cells\n",
@@ -694,7 +696,7 @@ llama_kv_cache_dsv4::llama_kv_cache_dsv4(
 
     kv_hca = std::make_unique<llama_kv_cache>(
             model, hparams_hca, type_k, type_v,
-            v_trans, offload, unified, dsv4_comp_size(kv_size, DSV4_HCA_RATIO), n_seq_max, n_pad,
+            v_trans, offload, unified, GGML_PAD(dsv4_comp_size(kv_size, DSV4_HCA_RATIO), 256u), n_seq_max, n_pad,
             0, LLAMA_SWA_TYPE_NONE, filter_hca, nullptr);
 
     LLAMA_LOG_INFO("%s: creating DSV4 lightning-indexer KV cache, size = %u cells\n",
@@ -702,7 +704,7 @@ llama_kv_cache_dsv4::llama_kv_cache_dsv4(
 
     kv_lid = std::make_unique<llama_kv_cache>(
             model, hparams_lid, type_k, type_v,
-            v_trans, offload, unified, dsv4_comp_size(kv_size, DSV4_CSA_RATIO), n_seq_max, n_pad,
+            v_trans, offload, unified, GGML_PAD(dsv4_comp_size(kv_size, DSV4_CSA_RATIO), 256u), n_seq_max, n_pad,
             0, LLAMA_SWA_TYPE_NONE, filter_csa, nullptr);
 
     LLAMA_LOG_INFO("%s: creating DSV4 CSA compressor state\n", __func__);
diff --git a/src/models/deepseek-v4.cpp b/src/models/deepseek-v4.cpp
index 3f3b0cf92..7bde3bcff 100644
--- a/src/models/deepseek-v4.cpp
+++ b/src/models/deepseek-v4.cpp
@@ -683,6 +683,10 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_csa_lid_attention(
     ggml_tensor * kq_mask = ggml_concat(ctx0, raw_mask, csa_mask, 0);
     cb(kq_mask, "csa_lid_kq_mask", il);
 
+    if (cparams.flash_attn && kq_mask->type != GGML_TYPE_F16) {
+        kq_mask = ggml_cast(ctx0, kq_mask, GGML_TYPE_F16);
+    }
+
     ggml_tensor * out = build_attn_mha(q, k_all, k_all, nullptr, kq_mask, sinks, nullptr, kq_scale, il);
     cb(out, "attn_csa_lid", il);
 
@@ -740,6 +744,10 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_hca_attention(
     ggml_tensor * kq_mask = ggml_concat(ctx0, raw_mask, hca_mask, 0);
     cb(kq_mask, "hca_kq_mask", il);
 
+    if (cparams.flash_attn && kq_mask->type != GGML_TYPE_F16) {
+        kq_mask = ggml_cast(ctx0, kq_mask, GGML_TYPE_F16);
+    }
+
     ggml_tensor * out = build_attn_mha(q, k_all, k_all, nullptr, kq_mask, sinks, nullptr, kq_scale, il);
     cb(out, "attn_hca", il);
 

With FA enabled and added lightning indexer GGML OP compute buffers memory usage got really low, I think processing 1M tokens is achievable on a single RTX PRO 6000 Max-Q with CPU expert offloading (f16 cache type) even with 8k ubatch size.

Some performance numbers (Epyc 9374F + RTX PRO 6000 Max-Q):

$ ./bin/llama-batched-bench -m ../models/DeepSeek-V4-Flash.gguf -b 8192 -ub 8192 -npl 1 -npp 8192,16384,32768,65536,131072,262144,524288 -ntg 32 -fa 1 -cmoe --no-repack
0.00.471.019 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance

llama_batched_bench: n_kv_max = 1048576, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |   10.947 |   748.36 |    2.981 |    10.74 |   13.927 |   590.49 |
| 16384 |     32 |    1 |  16416 |   22.323 |   733.96 |    2.817 |    11.36 |   25.140 |   652.98 |
| 32768 |     32 |    1 |  32800 |   46.820 |   699.87 |    2.878 |    11.12 |   49.699 |   659.98 |
| 65536 |     32 |    1 |  65568 |  102.828 |   637.34 |    2.962 |    10.80 |  105.790 |   619.79 |
|131072 |     32 |    1 | 131104 |  240.695 |   544.56 |    3.140 |    10.19 |  243.835 |   537.68 |
|262144 |     32 |    1 | 262176 |  624.131 |   420.01 |    3.503 |     9.14 |  627.634 |   417.72 |
|524288 |     32 |    1 | 524320 | 1860.555 |   281.79 |    4.218 |     7.59 | 1864.773 |   281.17 |

49.09.116.580 W ~llama_context:      CUDA0 compute buffer size of 24476.1461 MiB, does not match expectation of 4168.0000 MiB
49.09.116.584 W ~llama_context:  CUDA_Host compute buffer size of 16900.4862 MiB, does not match expectation of 16772.1562 MiB

Max memory usage I saw in nvidia-smi was 60836MiB / 97887MiB.

Edit: forgot about Pro benchmark results, aborted in the middle but it got to:

$ ./bin/llama-batched-bench -m ../../llama.cpp-dsv4/models/DeepSeek-V4-Pro.gguf -b 8192 -ub 8192 -npl 1 -npp 8192,16384,32768,65536,131072,262144,524288 -ntg 32 -fa 1 -cmoe --no-repack
0.00.497.037 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance

llama_batched_bench: n_kv_max = 1048576, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |   45.867 |   178.61 |    5.459 |     5.86 |   51.325 |   160.23 |
| 16384 |     32 |    1 |  16416 |   93.140 |   175.91 |    5.276 |     6.07 |   98.416 |   166.80 |
| 32768 |     32 |    1 |  32800 |  191.703 |   170.93 |    5.373 |     5.96 |  197.076 |   166.43 |
| 65536 |     32 |    1 |  65568 |  402.959 |   162.64 |    5.543 |     5.77 |  408.502 |   160.51 |
|131072 |     32 |    1 | 131104 |  883.115 |   148.42 |    5.860 |     5.46 |  888.975 |   147.48 |

@fairydreaming

fairydreaming commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

@am17an Any specific reason you went with DEEPSEEK_V4_FLASH/deepseek-v4-flash/deepseek_v4_flash when naming things instead of simply DEEPSEEK4/deepseek4/deepseek4? I mean this convention is a bit inconsistent with existing names and the flash part is confusing (sounds like flash only while pro uses this architecture too), maybe it would be better to change it now before it spreads? (I noticed that even the architecture name in GGUF is deepseek-v4-flash, so we'd have to update it in existing GGUF files or reconvert).

@am17an

am17an commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

I'm going to work on making graph reuse work across various compression boundaries and also make multi-sequence work, along with fixing a couple of issues. After that I think a round of simple optimization + running some evals and then this should be ready for review.

Since it's a large PR it may make sense to separate out conversion, chat and then the model into separate PRs. In parallel #24231 + FA can be included when they're ready

@fairydreaming

Copy link
Copy Markdown
Collaborator

@am17an Sounds good, I stared at tensor values for the last few days comparing them with the DeepSeek inference code but haven't found any obvious problems.

@fairydreaming

fairydreaming commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

For anyone interested I have this PR with various optimizations (#24231+CUDA, #24011, FA changes) in my repo: https://github.com/fairydreaming/llama.cpp/tree/dsv4

PP is the same as reported above, TG is ~70% faster.

Comment thread common/chat.cpp Outdated
@rujialiu

Copy link
Copy Markdown

For anyone interested I have this PR with various optimizations (#24231+CUDA, #24011, FA changes) in my repo: https://github.com/fairydreaming/llama.cpp/tree/dsv4

Thanks! I tried but failed. It looks like antirez's gguf is not yet supported?

0.00.117.161 I srv    load_model: loading model '\gguf\DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf'
0.00.230.174 E llama_model_load: error loading model: error loading model hyperparameters: key not found in model: deepseek4.swiglu_clamp_shexp
0.00.230.184 E llama_model_load_from_file_impl: failed to load model

@fairydreaming

Copy link
Copy Markdown
Collaborator

For anyone interested I have this PR with various optimizations (#24231+CUDA, #24011, FA changes) in my repo: https://github.com/fairydreaming/llama.cpp/tree/dsv4

Thanks! I tried but failed. It looks like antirez's gguf is not yet supported?

0.00.117.161 I srv    load_model: loading model '\gguf\DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf'
0.00.230.174 E llama_model_load: error loading model: error loading model hyperparameters: key not found in model: deepseek4.swiglu_clamp_shexp
0.00.230.184 E llama_model_load_from_file_impl: failed to load model

@rujialiu Unfortunately there are multiple naming differences for model parameters and tensors that prevent antirez GGUFs from working with this PR.

@fairydreaming

Copy link
Copy Markdown
Collaborator

@am17an On the other hand maybe it's a good idea to unify the naming with antirez GGUFs? From what I see in files there's only a single difference in tensor shapes - in attention output tensor - [4096, 1024, 8] vs [4096, 8192, 1]. I can try to fix this it in the meantime, what do you think?

@rujialiu

Copy link
Copy Markdown

@rujialiu Unfortunately there are multiple naming differences for model parameters and tensors that prevent antirez GGUFs from working with this PR.

Thanks for the reply. I'm especially interested in trying this REAP version in antirez's format, which (hopefully) is small enough for lower-end machines with only 64GB RAM:
https://www.modelscope.cn/models/0xSero/DeepSeek-V4-Flash-162B-GGUF

@am17an

am17an commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@fairydreaming sure, I think it makes sense to support already existing GGUFs. BTW can you check the latest commit for any perf improvements on your setup? Graph reuse was added across CSA boundaries

@fairydreaming

Copy link
Copy Markdown
Collaborator

@am17an Merged the changes and I see an improvement, TG in Flash now exceeds 20 t/s for short prompts (was around 18):

$ ./bin/llama-batched-bench -m ../../llama.cpp-dsv4/models/DeepSeek-V4-Flash.gguf -b 8192 -ub 8192 -npl 1 -npp 8192,16384,32768,65536,131072,262144,524288,1048064 -ntg 128 -fa 1 -cmoe --no-repack
0.00.464.041 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance

llama_batched_bench: n_kv_max = 1048576, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |    128 |    1 |   8320 |   10.998 |   744.89 |    6.324 |    20.24 |   17.321 |   480.33 |
| 16384 |    128 |    1 |  16512 |   22.388 |   731.82 |    6.224 |    20.57 |   28.612 |   577.11 |
| 32768 |    128 |    1 |  32896 |   47.058 |   696.33 |    6.393 |    20.02 |   53.451 |   615.44 |
| 65536 |    128 |    1 |  65664 |  103.147 |   635.36 |    6.553 |    19.53 |  109.700 |   598.58 |
...

@fairydreaming

Copy link
Copy Markdown
Collaborator

@rujialiu OK, this is weird. I made a patch that allows antirez DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf that I downloaded some time ago to work in this PR, but your DeepSeek-V4-Flash-Spark-Mini-Q2-REAP-ds4.gguf for some reason causes CUDA error: an illegal memory access was encountered. The only difference between them is the number of experts, so it's extra weird. Still investigating.

@Lowkey-Loki-SN

Lowkey-Loki-SN commented Jun 12, 2026

Copy link
Copy Markdown

Not sure if it's too early for this but I'm noticing a consistently reproducible issue where the model outputs malformed JSX tags during long responses as follows:

return ( <
    section className = "hero"
    ref = { heroRef } >
    <
    span className = "hero__label label hero__animate" > Lumina < /span> {

Happens with both the raw unquantized Q8 GGUF and the quantized Q3 GGUF that I normally use but isn't reproducible with responses over the web/API.

Doesn't happen with short responses.

Repo used:

https://github.com/fairydreaming/llama.cpp, ds4 branch
Commit hash: abd1bee

Command used for HF -> GGUF:

python3 convert_hf_to_gguf.py \
  ../Models/HF/DeepSeek-V4-Flash/ \
  --outfile ~/AI/Models/GGUFs/DeepSeek-V4-Flash.gguf \
  --outtype q8_0 \
  --fp8-as-q8 \
  --use-temp-file

Command used for Quantization:

cat > "dsv4-flash-q3-robust.tensortypes" <<'EOF'
^blk\.[0-2]\.ffn_(gate|up|down)_exps\.weight$=mxfp4
^blk\.(3|42)\.ffn_down_exps\.weight$=mxfp4
ffn_down_exps=q3_K
ffn_gate_exps=q3_K
ffn_up_exps=q3_K
^token_embd\.weight$=q8_0
^output\.weight$=q8_0
indexer\.attn_q_b=q8_0
indexer=bf16
attn_comp=bf16
attn=q8_0
shexp=q8_0
nextn=q8_0
EOF

build/bin/llama-quantize \
  --allow-requantize \
  --tensor-type-file dsv4-flash-q3-robust.tensortypes \
  ../Models/GGUFs/DeepSeek-V4-Flash.gguf \
  ../Models/GGUFs/dsv4-flash-q3.gguf \
  Q3_K_S

Launch command:

CUDA_VISIBLE_DEVICES=1,0 build/bin/llama-server -m ~/AI/Models/GGUFs/dsv4-flash-q3.gguf -c 200000 -ngl 99 -fa 1 --jinja -np 1 --chat-template-file models/templates/deepseek-ai-DeepSeek-V4.jinja --no-mmap -ot ".ffn_(up|down)_exps.=CPU","([3-7]+).ffn_.*_exps.=CPU" -ts 0.46,0.54 --port 1234 -b 2048 -ub 2048

Prompt used for this:

Create a single-file HTML website (index.html) that serves as a high-end SaaS landing page for a next-generation developer product. Format: Single HTML file, No build tools, runs directly in the browser. Libraries: React 18, Babel Standalone, GSAP (Core + ScrollTrigger). Styling: Hand-written CSS only, No Tailwind. Design: Soft off-white background, near-black text, one accent colour. Modern sans-serif body, expressive display font, Tone: Editorial, minimal, no flashy effects, no glassmorphism. Layout: Full height hero, scroll-driven typography transformation, asymmetrical product philosophy grid. Signature Animation: One unforgettable visual movement (eg. system spanning into alignment). Final CTA. All motiopn must be scroll-linked

My Setup:

2x RTX 3080 20GB
Xeon 6148
128GB DDR4 2666hz

@rujialiu

Copy link
Copy Markdown

@rujialiu OK, this is weird. I made a patch that allows antirez DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf that I downloaded some time ago to work in this PR, but your DeepSeek-V4-Flash-Spark-Mini-Q2-REAP-ds4.gguf for some reason causes CUDA error: an illegal memory access was encountered. The only difference between them is the number of experts, so it's extra weird. Still investigating.

@fairydreaming Thanks! I tried that REAP version with cchuter's branch i.e. https://github.com/cchuter/llama.cpp/tree/feat/v4-port-cuda which works with that DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf on my machine (tg ~4 tok/s, pp even slower). I also got:

CUDA error: an illegal memory access was encountered
D:\llama.cpp-cchuter\ggml\src\ggml-cuda\ggml-cuda.cu:108: CUDA error
  current device: 0, in function ggml_backend_cuda_synchronize at D:\llama.cpp-cchuter\ggml\src\ggml-cuda\ggml-cuda.cu:3327

I can't check whether this REAP gguf works with antirez's ds4 because ds4 doesn't support native Windows. I had good experience running Minimax 2.5 REAP with llama.cpp, but I don't have any way to ensure that gguf is sane (or at least works with official ds4). Sorry about that.

@rujialiu

Copy link
Copy Markdown

@fairydreaming OK, I found that cchuter's branch works with that REAP gguf (actually I tried a slightly larger 180B REAP gguf instead) with --device none to force CPU backend. I tried some non-trivial prompts and the output looks good. So probably the REAP gguf is good and the issue is caused somewhere outside this PR (because cchuter's branch also suffers from the same issue)

@am17an

am17an commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

@Lowkey-Loki-SN I think it is something to do with tokenization, it messes up even small JAX templates for me. Mostly extra whitespace.

@Lowkey-Loki-SN

Copy link
Copy Markdown

@Lowkey-Loki-SN I think it is something to do with tokenization, it messes up even small JAX templates for me. Mostly extra whitespace.

Glad to hear it's reproducible on your end too! And yes, it is always either extra whitespace or newlines when it happens on my end

@fairydreaming

fairydreaming commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

@rujialiu From what I see the problem is that expert indices read from tid2eid tensors during quantize_mmq_q8_1() (that is called inside ggml_cuda_mul_mat_id() are wrong (should be from 0 to 143, but I see large numbers there). But I checked these tensors in GGUF file with hexdump and found only values from 0 to 143 inside, so GGUF seems to be OK. Perhaps there's some tensor data memory corruption going on. I have a small reproducible example of this error and it works fine with ubatch 31, but fails with ubatch 32.

Edit: @am17an is right, I disabled expert offloading (so all CUDA now) and now it works on with ubatch 8 but fails with ubatch 9.

@am17an

am17an commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

@fairydreaming ubatch 32 is when the offload would kick in, so probably something in cuda backend

@ggerganov

Copy link
Copy Markdown
Member

Btw, I've noticed that sometimes the response goes inside the reasoning block:

Image Is this expected?

This continues to happen with the latest version.

@am17an

am17an commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

This continues to happen with the latest version.

What's your command to launch the server?

@ggerganov

Copy link
Copy Markdown
Member

This continues to happen with the latest version.

What's your command to launch the server?

make -j && ./bin/llama-server -m ./DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf --port 8014 -c 65536 --host 0.0.0.0 -lv 4

Comment thread src/llama-kv-cache-dsv4.h Outdated
Comment on lines +1192 to +1201
if (p0 > 0) {
// DSV4 compressed cache rows are derived from running compressor state,
// so arbitrary rollback is not reconstructible from the raw cache alone.
// Allow the common prompt-cache cleanup no-op: remove [end, infinity).
if (seq_id >= 0 && p0 > kv_raw->seq_pos_max(seq_id)) {
return true;
}

return false;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without partial sequence removal, are we going to be able to support MTP?

@am17an am17an Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still use checkpoint and do MTP=1. The current partial state is just ~17 Mb so it should be possible to similar to what we do in Qwen for MTP > 1

Comment on lines +1131 to +1163
// When either raw or compressed state is per-sequence, split ubatches so
// every token maps cleanly to its stream. This may serialize independent
// non-unified sequences, but keeps compressed state ownership explicit.
do {
balloc.split_reset();

std::vector<llama_ubatch> ubatches;
while (true) {
llama_ubatch ubatch;
if (comp_coupled_same_set) {
ubatch = balloc.split_equal(n_ubatch, false);
} else if (comp_coupled) {
ubatch = balloc.split_seq(1);
} else if (comp_per_seq) {
ubatch = balloc.split_seq(n_ubatch);
} else {
ubatch = balloc.split_equal(n_ubatch, raw_per_seq);
}

if (ubatch.n_tokens == 0) {
break;
}
ubatches.push_back(std::move(ubatch)); // NOLINT
}

if (balloc.get_n_used() < balloc.get_n_tokens()) {
break;
}

if (auto ctx = make_context(std::move(ubatches))) {
return ctx;
}
} while (false);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at the multi-sequence change (e16065f) and it seems that it does not accomplish the goal of supporting properly the non-unified KV cache. For context about how the non-unified KV cache should work see #14363. In short, it requires ubatches with equal sequence lengths (i.e. split_equal).

However, the implemented logic always does split_seq. This is the correct thing to do when the non-unified KV cache is not supported by the graph. The idea is that when we use split_seq, we guarantee that each ubatch will only have tokens from a single sequence, so the graph does not need to handle multiple streams. For example, we do the same thing with the recurrent cache when using rollbacks because the non-unified cache currently is not supported there too:

if (n_rs_seq > 0) {
// [TAG_RECURRENT_ROLLBACK_SPLITS]
// TODO: recurrent state rollback does not support equal splits
ubatch = balloc.split_seq(n_ubatch);
} else {

This is a workaround, not the proper solution. If my understanding is correct, I think a lot of the new logic added in that commit is not necessary because in the end, we still end up using split_seq instead of the desired split_equal. Therefore we can simply workaround by using split_seq, similar to the recurrent cache above and avoid the extra logic.

In the future, we have to rework non-unified KV cache to be properly supported. I'm planning to do it for the recurrent memory first, so that Qwen3.6 runs faster with parallel sequences. For DS4 I was hoping we can start with the correct implementation from the beginning. But if it is too complicated, we can try to do it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) conversion CUDA Related to the CUDA backend devops improvements to build systems and github actions documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning jinja parser Issues related to the jinja parser model Model specific python python script changes script Script related server/ui server SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend WebGPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.